AT-ST: Self-training Adaptation Strategy for OCR in Domains with Limited Transcriptions
نویسندگان
چکیده
This paper addresses text recognition for domains with limited manual annotations by a simple self-training strategy. Our approach should reduce human annotation effort when target domain data is plentiful, such as transcribing collection of single person's correspondence or large manuscript. We propose to train seed system on scale from related mixed available annotated the domain. The transcribes unannotated which then used better system. study several confidence measures and eventually decide use posterior probability transcription selection. Additionally, we augment using an aggressive masking scheme. By self-training, achieve up 55 % reduction in character error rate handwritten 38 printed data. augmentation itself reduces about 10 its effect pronounced case difficult
منابع مشابه
using game theory techniques in self-organizing maps training
شبکه خود سازمانده پرکاربردترین شبکه عصبی برای انجام خوشه بندی و کوانتیزه نمودن برداری است. از زمان معرفی این شبکه تاکنون، از این روش در مسائل مختلف در حوزه های گوناگون استفاده و توسعه ها و بهبودهای متعددی برای آن ارائه شده است. شبکه خودسازمانده از تعدادی سلول برای تخمین تابع توزیع الگوهای ورودی در فضای چندبعدی استفاده می کند. احتمال وجود سلول مرده مشکلی اساسی در الگوریتم شبکه خودسازمانده به حسا...
OCR with No Shape Training
We present a document-specific OCR system and apply it to a corpus of faxed business letters. Unsupervised classification of the segmented character bitmaps on each page, using a “clump” metric, typically yields several hundred clusters with highly skewed populations. Letter identities are assigned to each cluster by maximizing matches with a lexicon of English words. We found that for 2/3 of t...
متن کاملUnsupervised Arabic Dialect Adaptation with Self-Training
Useful training data for automatic speech recognition systems of colloquial speech is usually limited to expensive in-domain transcription. Broadcast news is an appealing source of easily available data to bootstrap into a new dialect. However, some languages, like Arabic, have deep linguistic differences resulting in poor cross domain performance. If no in-domain transcripts are available, but...
متن کاملOCR of handwritten transcriptions of Ancient Egyptian hieroglyphic text
Encoding hieroglyphic texts is time-consuming. If a text already exists as hand-written transcription, there is an alternative, namely OCR. Off-the-shelf OCR systems seem difficult to adapt to the peculiarities of Ancient Egyptian. Presented is a proof-of-concept tool that was designed to digitize texts of Urkunden IV in the hand-writing of Kurt Sethe. It automatically recognizes signs and prod...
متن کاملImproving Lightly Supervised Training for Broadcast Transcriptions
This paper investigates improving lightly supervised acoustic model training for an archive of broadcast data. Standard lightly supervised training uses automatically derived decoding hypotheses using a biased language model. However, as the actual speech can deviate significantly from the original programme scripts that are supplied, the quality of standard lightly supervised hypotheses can be...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Lecture Notes in Computer Science
سال: 2021
ISSN: ['1611-3349', '0302-9743']
DOI: https://doi.org/10.1007/978-3-030-86337-1_31